Skip to content

fix: add read deadline to tls write#3283

Merged
dnwe merged 1 commit intoIBM:mainfrom
bvalente:tls-deadline
Sep 15, 2025
Merged

fix: add read deadline to tls write#3283
dnwe merged 1 commit intoIBM:mainfrom
bvalente:tls-deadline

Conversation

@bvalente
Copy link
Copy Markdown
Contributor

@bvalente bvalente commented Sep 8, 2025

Related to:

We're using https://github.com/Mongey/terraform-provider-kafka to manage Kafka Topics with Terraform. Recently we've changed from Plaintext communications to AWS IAM Authentication. When doing so, our provider sometimes would hang indefinitely on some plans. We pinned this to the kafka.t3.small cluster tiers, as these have several limitations, including a maximum of 4 TCP connections per second.

While debugging the provider, we understood that the Call Stack was stuck on writing to the cluster, more specifically right on the first communication that it was trying to do with the clusters. Reading through the code, we found a very interesting comment for the Write function of the TLS package.

https://github.com/golang/go/blob/go1.23.0/src/crypto/tls/conn.go#L1192-L1195

// As Write calls [Conn.Handshake], in order to prevent indefinite blocking a deadline
// must be set for both [Conn.Read] and Write before Write is called when the handshake
// has not yet completed. See [Conn.SetDeadline], [Conn.SetReadDeadline], and
// [Conn.SetWriteDeadline].

Based on this, TLS requires both Write and Read Deadlines to be set because the Write function may do a handshake on the fist communication, and the handshake both Writes and Reads.

I believe that in our case, since we are working with brokers that don't have a very reliable network, sometimes the handshake would not progress on the server side, and we would indefinitely wait for a Read that would never come.

After implementing this change in our local workstation, instead of experiencing indefinite hanging, the program would finally report some time of error:

Error: kafka: client has run out of available brokers to talk to: read tcp 10.xxx.xxx.xxx:59582->10.xxx.xxx.xxx:9098: i/o timeout

Copy link
Copy Markdown
Collaborator

@puellanivis puellanivis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing I could think is to join the time.Now() calls into a common local variable, so they’re both based on the same “now”.

But yeah, good change all over otherwise. 👍

Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com>
@bvalente
Copy link
Copy Markdown
Contributor Author

bvalente commented Sep 9, 2025

@puellanivis thank you for the review

I addressed your comment, and force pushed after rebasing with master

Copy link
Copy Markdown
Collaborator

@puellanivis puellanivis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. :)

@bvalente
Copy link
Copy Markdown
Contributor Author

Hello @puellanivis, what would be the process to get this merged and tagged? Is there a timeline, or anything I can do from our side? 🙂

@puellanivis
Copy link
Copy Markdown
Collaborator

Sometimes reviews from IBM can take a while. I don’t actually have any ability to even approve in my code review, let alone merge anything. I’m just a third-party F/OSS contributor helping out with code reviews.

Copy link
Copy Markdown
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bvalente thanks! this was a good catch

@dnwe dnwe added the fix label Sep 15, 2025
@dnwe dnwe merged commit 25368c4 into IBM:main Sep 15, 2025
17 checks passed
3AceShowHand pushed a commit to 3AceShowHand/sarama that referenced this pull request Apr 17, 2026
Related to:
- golang/go#13828
- IBM#1722

We're using https://github.com/Mongey/terraform-provider-kafka to manage
Kafka Topics with Terraform. Recently we've changed from Plaintext
communications to AWS IAM Authentication. When doing so, our provider
sometimes would hang indefinitely on some plans. We pinned this to the
`kafka.t3.small` cluster tiers, as these have several limitations,
including a maximum of 4 TCP connections per second.

While debugging the provider, we understood that the Call Stack was
stuck on writing to the cluster, more specifically right on the first
communication that it was trying to do with the clusters. Reading
through the code, we found a very interesting comment for the Write
function of the TLS package.


https://github.com/golang/go/blob/go1.23.0/src/crypto/tls/conn.go#L1192-L1195
```
// As Write calls [Conn.Handshake], in order to prevent indefinite blocking a deadline
// must be set for both [Conn.Read] and Write before Write is called when the handshake
// has not yet completed. See [Conn.SetDeadline], [Conn.SetReadDeadline], and
// [Conn.SetWriteDeadline].
```
Based on this, TLS requires both Write and Read Deadlines to be set
because the Write function may do a handshake on the fist communication,
and the handshake both Writes and Reads.

I believe that in our case, since we are working with brokers that don't
have a very reliable network, sometimes the handshake would not progress
on the server side, and we would indefinitely wait for a Read that would
never come.

After implementing this change in our local workstation, instead of
experiencing indefinite hanging, the program would finally report some
time of error:
```
Error: kafka: client has run out of available brokers to talk to: read tcp 10.xxx.xxx.xxx:59582->10.xxx.xxx.xxx:9098: i/o timeout
```

Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com>
3AceShowHand pushed a commit to 3AceShowHand/sarama that referenced this pull request Apr 17, 2026
Related to:
- golang/go#13828
- IBM#1722

We're using https://github.com/Mongey/terraform-provider-kafka to manage
Kafka Topics with Terraform. Recently we've changed from Plaintext
communications to AWS IAM Authentication. When doing so, our provider
sometimes would hang indefinitely on some plans. We pinned this to the
`kafka.t3.small` cluster tiers, as these have several limitations,
including a maximum of 4 TCP connections per second.

While debugging the provider, we understood that the Call Stack was
stuck on writing to the cluster, more specifically right on the first
communication that it was trying to do with the clusters. Reading
through the code, we found a very interesting comment for the Write
function of the TLS package.


https://github.com/golang/go/blob/go1.23.0/src/crypto/tls/conn.go#L1192-L1195
```
// As Write calls [Conn.Handshake], in order to prevent indefinite blocking a deadline
// must be set for both [Conn.Read] and Write before Write is called when the handshake
// has not yet completed. See [Conn.SetDeadline], [Conn.SetReadDeadline], and
// [Conn.SetWriteDeadline].
```
Based on this, TLS requires both Write and Read Deadlines to be set
because the Write function may do a handshake on the fist communication,
and the handshake both Writes and Reads.

I believe that in our case, since we are working with brokers that don't
have a very reliable network, sometimes the handshake would not progress
on the server side, and we would indefinitely wait for a Read that would
never come.

After implementing this change in our local workstation, instead of
experiencing indefinite hanging, the program would finally report some
time of error:
```
Error: kafka: client has run out of available brokers to talk to: read tcp 10.xxx.xxx.xxx:59582->10.xxx.xxx.xxx:9098: i/o timeout
```

Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com>
3AceShowHand added a commit to pingcap/sarama that referenced this pull request Apr 17, 2026
* Fix data race on Broker.done channel (IBM#2698)

The underlying case was not waiting for the goroutine running the
`responseReceiver()` method to fully complete if SASL authentication
failed. This created a window where a further call to `Broker.Open()`
could overwrite the `Broker.done` channel value while the goroutine
still running `responseReceiver()` was trying to close the same channel.

Fixes: IBM#2382

Signed-off-by: Adrian Preston <PRESTONA@uk.ibm.com>

* fix: add read deadline to tls write (IBM#3283)

Related to:
- golang/go#13828
- IBM#1722

We're using https://github.com/Mongey/terraform-provider-kafka to manage
Kafka Topics with Terraform. Recently we've changed from Plaintext
communications to AWS IAM Authentication. When doing so, our provider
sometimes would hang indefinitely on some plans. We pinned this to the
`kafka.t3.small` cluster tiers, as these have several limitations,
including a maximum of 4 TCP connections per second.

While debugging the provider, we understood that the Call Stack was
stuck on writing to the cluster, more specifically right on the first
communication that it was trying to do with the clusters. Reading
through the code, we found a very interesting comment for the Write
function of the TLS package.


https://github.com/golang/go/blob/go1.23.0/src/crypto/tls/conn.go#L1192-L1195
```
// As Write calls [Conn.Handshake], in order to prevent indefinite blocking a deadline
// must be set for both [Conn.Read] and Write before Write is called when the handshake
// has not yet completed. See [Conn.SetDeadline], [Conn.SetReadDeadline], and
// [Conn.SetWriteDeadline].
```
Based on this, TLS requires both Write and Read Deadlines to be set
because the Write function may do a handshake on the fist communication,
and the handshake both Writes and Reads.

I believe that in our case, since we are working with brokers that don't
have a very reliable network, sometimes the handshake would not progress
on the server side, and we would indefinitely wait for a Read that would
never come.

After implementing this change in our local workstation, instead of
experiencing indefinite hanging, the program would finally report some
time of error:
```
Error: kafka: client has run out of available brokers to talk to: read tcp 10.xxx.xxx.xxx:59582->10.xxx.xxx.xxx:9098: i/o timeout
```

Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com>

* fix(client): ignore empty Metadata responses when refreshing (IBM#2672)

We should skip the metadata refresh if the startup phase broker returns empty brokers in metadata response. The Java client skips the empty response to update the metadata cache (https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L1149) and we should make a feature parity in Sarama too

Fixes IBM#2664

Signed-off-by: Hao Sun <haos@uber.com>

---------

Signed-off-by: Adrian Preston <PRESTONA@uk.ibm.com>
Signed-off-by: Bernardo Valente <bernardofvalente@gmail.com>
Signed-off-by: Hao Sun <haos@uber.com>
Co-authored-by: Adrian Preston <prestona@users.noreply.github.com>
Co-authored-by: bvalente <bernardofvalente@gmail.com>
Co-authored-by: HaoSunUber <86338940+HaoSunUber@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants